Homework 4

Dimensionality Reduction

Author: Yichen Xie

NetID : yx2606

In this homework, you will be using the FIFA 2022 Dataset, which is a .csv where each row is a player in the FIFA 2022 video game. Each player is described by a variety of attributes, like crossing ability, stamina, etc. Each attribute is described by a "grade". The data contains the 500 most valuable players in the game.

The list of attributes is ['Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Dribbling', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Reactions', 'ShotPower', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Penalties', 'StandingTackle', 'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes']. Consider these the features. You will need to use these for PCA and t-SNE.

Start by uploading the fifa.csv to the notebook. Then run the code below, which downloads the player photos.

If there is code that requires a random seed (for example, t-SNE) please set it to 2022.

Step 1: Preprocessing

Create a new variable, Position, and group the following positions (found in the Best Position feature) together:

LB, RB, LWB, RWB - Wing Back

RW, LW, RM, LM - Winger

CAM, CDM, CM - Central Midfielder CF, ST - Striker

CB - Central Defender

GK - Goalkeeper

Below is a diagram of the soccer positions and their groupings.

soccer.png

Step 2.1: PCA

Using the list of attributes above, reduce the dimensionality of the data using 2 principal components.

Plot the first and second components using altair. Color each point based on its grouping from Step 1, and make sure the tooltip for each point contains the player name, position and photo (use the image feature created in the first two code blocks).

What groupings of players are most directly visible?

The Goalkeepers are the most directly visible in this graph.

Step 2.2: PCA (no Goalkeepers)

Now, remove the goalkeepers from the data set, and re-run PCA. What does the first principal component seem to indicate (as you scan over the range of the 1st principal component, what relationship do you see?)

The first principal component seems to indicate the defense ability because the central defender and wingback appears on the right side.

Step 3.1: t-SNE

Using the t-SNE function from sklearn, create plot the results of using 2 components with the rest of the parameters set to the default. Set the random state to 2022.

Be sure to plot in the same way as you did in 2.1 and 2.2 (using the same color indications, tooltips, etc.)

What relationships do you see in the t-SNE output?

The First component of TSNE is more likely to be the measure of ATTACK abilities, for the attackers(Striker, Winger, etc.) are on the left side in this graph.

The second component of TSNE in the first graph is likely to be the measure of position. The players who stays in the front of the field have a lower value, while those who stays at the back of the field have a higher value.

Step 3.2: t-SNE (only central midfielders)

Now, do the same as above but only for central midfielders. This time, instead of indicating the broad position group in your chart (which would be "Midfielders" in this case), indicate the specific position through color, like "CAM", "CDM" and so on.

What relationships do you see in the t-SNE output?

The CM stays between CAM and CDM.

I think the second component shows the ability of defense. While the meaning of the first component is not so obvious.

Step 3.3: t-SNE Parameters

Read this Distill on t-SNE parameters.

The perplexity parameter is described in the Distill as

"...which says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has."

Create two plots, one where you select a perplexity such that the points appear more tightly clustered (many small, tight clusters), and another where they are less tightly clustered (larger, less clear clusters). The default perplexity is 30, so consider this the baseline. Be sure to set the random seed to 2022 and use all of the data.

Step 3.4 t-SNE Questions

Answer the following questions:

(1) What do cluster sizes mean in t-SNE (e.g., one cluster with a large standard deviation vs. another with a tighter distribution)?

(2) Do distances between clusters or points mean something?

(3) What are some advantages of t-SNE over PCA?

(4) What are some disadvantages of t-SNE over PCA?

Answers:

(1) The cluster with a smaller size means the points in this cluster have smaller distances in the probability space, and it means that the cluster has a tigher distribution.

(2) The distance between points in tSNE represents the ratio of their probability distance which is mapped from their Euclidian Distance. If two points have a smaller distance, they should probablity be more similar than the others. And so do the clusters.

(3) T-SNE usually shows better visualization results than PCA. T-SNE are more likely to make obvious clusters because it focus more on the local features of high dimension data.

(4) T-SNE is too slow and costs too much computational resources in large dataset tasks. And the parameters (e.g. distance, perplexity) of t-SNE are not so easily interpretable as those of PCA.